Partitioning-based clustering for Web document categorization
نویسندگان
چکیده
Clustering techniques have been used by many intelligent software agents in order to retrieve lter and categorize documents available on the World Wide Web Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents or use probabilistic techniques such as Bayesian classi cation Many of these traditional algorithms however falter when the dimensionality of the feature space becomes high relative to the size of the document space In this paper we introduce two new clustering algorithms that can e ectively cluster documents even in the presence of a very high dimensional feature space These clustering techniques which are based on generalizations of graph partitioning do not require pre speci ed ad hoc distance functions and are capable of automatically discovering document similarities or associations We conduct several experiments on real Web data using various feature selection heuristics and compare our clustering schemes to standard distance based techniques such as hierarchical agglomeration clustering and Bayesian classi cation methods such as AutoClass
منابع مشابه
Web Document Clustering Using Threshold Selection Partitioning
Clustering techniques have been applied to categorize documents on World Wide Web. In previous research, PDDP (Principal Direction Divisive Partitioning) is a well-known clustering algorithm. PDDP algorithm employs top-down and unsupervised clustering based on the principal component analysis and splits documents into two sets using a plane perpendicular to the maximum principal direction passi...
متن کاملAnalysis of Clustering Algorithms for Web-Based Search
Automatic document categorization plays a key role in the development of future interfaces for Web-based search. Clustering algorithms are considered as a technology that is capable of mastering this “ad-hoc” categorization task. This paper presents results of a comprehensive analysis of clustering algorithms in connection with document categorization. The contributions relate to exemplar-based...
متن کاملImpact of Similarity Measures on Web-page Clustering
Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possi...
متن کاملData Mining Process Using Clustering : A Survey
Clustering is a basic and useful method in understanding and exploring a data set. Clustering is division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Interest in clustering has increased recently in new areas of applications including data mining, bioinformatics, web mining...
متن کاملWeb Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering
Clustering techniques have been used by many intelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of doc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Decision Support Systems
دوره 27 شماره
صفحات -
تاریخ انتشار 1999